The pCAZyme classifiers dbCAN, CUPP and eCAMI were independently evaluated against a high quality benchmark test set. The performances were evaluated upon the CAZyme/non-CAZyme differentiation and multilabel classification of CAZy family annotations. This notebook contains that statistical evaluation of the CAZyme classifiers.
Results summary:
- dbCAN and DIAMOND showed the strongest performances in CAZyme/non-CAZyme differentiation - dbCAN was the strongest performing tool across all categories, Hotpep (a tool invoked by dbCAN) was the weakest - The performances between CUPP and eCAMI were similar, although CUPP should a marginally better performance when comparing the multilabel classification of CAZy family annotations - The performance of dbCAN may be optimised by substituting Hotpep with CUPP and/or eCAMI

1 Introduction

The CAZyme classifiers dbCAN (Zhange et al. 2018), CUPP (Barrett and Lange, 2019) and eCAMI (Xu et al. 2019) use different methods to predict if a protein is a CAZyme or non-CAZyme, and predict the CAZy family annotations for predicted CAZymes. These classifiers have not been independently evaluated against a high quality benchmark test set.

This notebook layouts out the independent evaluation of dbCAN, CUPP and eCAMI against a high quality benchmark test set. The tools were evaluated upon their ability to differentiate between CAZymes and non-CAZymes, and their performance of predicting the CAZy family annotations of predicted CAZymes.

dbCAN incorporates the three protein function classifiers HMMER (Potter et al. 2018), Hotpep (Busk et al. 2017), and DIAMOND (Buchfink et al. 2015). In order to comprehensively evaluate the preformance of dbCAN, the predictions from HMMER, Hotpep and DIAMOND were evaluated independently of each other, and the consensus prediction (a prediction which at least two of the tools agree upon) was defined as the dbCAN result.

2 Test sets

A single test set of 100 CAZymes and 100 non-CAZymes with the highest sequence similarity (rated by bit-score ratio) was created per genomic assembly selected to be included in the benchmark test set. Choosing the 100 non-CAZymes with the highest sequence similarity was devised to increase the probability of causing confusion, to gather a better idea of the expected performance when using the classifiers. An equal number of CAZymes to non-CAZymes was selected to prevent over representation of one population over the other.

For inclusion of a genomic assembly for the creation of a test set, the assembly had to meet of all the following criteria:

  • Contains at least 100 CAZymes
  • Contains at least 100 non-CAZymes
  • Has an ‘Assembly level’ of ‘Complete Genome’ in the NCBI Assembly database
  • Protein records are still present in NCBI
  • Not listed as an ‘Anomalous assembly’ in the NCBI Assembly database

The genomic assemblies were also chosen from a range of taxonomies to provide as informative image of the performance of the classifiers over a range of datasets that users may wish to analyse.

Table ?? contains the genomic assemblies used to create the test sets for the evaluation. In total 81 assemblies were chose, 1 from an Oomycete species (more Oomycete species with greater than 100 CAZymes in CAZy could not be found), 25 fungal Ascomycetes species were selected, 13 Yeast, 2 Eukaryote microorganisms, 20 Gram positive bacteria, and 20 Gram negative bacteria, and figure 2.1 presents the distribution of CAZome coverage all 70 genomes.

## [1] "Mean percentage of genome incorporated in the CAZome across all test sets:"
## [1] 3.140472
## [1] "Standard deviation of the percentage of genome incorporated in the CAZome across all test sets:"
## [1] 1.174488
## [1] "Mean percentage of CAZomes incorporated in the test set across all genomes:"
## [1] 64.37203
## [1] "Standard deviation of the percentage of CAZome incorporated in the test set across all genomes:"
## [1] 25.54491
Histogram of CAZome coverage of the test sets for each respective source genomic assembly, overlayed by a box and whisker plot of the percentage of the CAZome incorproated in the test set.

Figure 2.1: Histogram of CAZome coverage of the test sets for each respective source genomic assembly, overlayed by a box and whisker plot of the percentage of the CAZome incorproated in the test set.

3 CAZyme/non-CAZyme classification

The assignment of CAZy family annotations by a CAZyme classifier identifies the protein as a CAZyme. If no CAZy family annotations are assigned to a protein by a CAZyme classifier, the tool identified the protein as a non-CAZyme. This notebook evaluates the performance of the CAZyme classifiers dbCAN (which incorporates HMMER, Hotpep and DIAMOND), CUPP and eCAMI for this binary CAZyme/non-CAZyme classification.

3.1 Summary statistics

For every classifier-test set pair, the specificity, sensitivity, prevision, F1-score and accuracy were calculated.

The mean of each statistical parameter was calculated for each classifier across all tests, to represent the overall performance of each CAZyme classifier.

These results are presented in table 3.1.

Table 3.1: Overall performance of CAZyme classifiers differentiation between CAZymes and non-CAZymes
Classifier Spec Mean Spec Standard Deviation Spec Lower CI Spec Upper CI Sens Mean Sens Standard Deviation Sens Lower CI Sens Upper CI Prec Mean Prec Standard Deviation Prec Lower CI Prec Upper CI F1-score Mean F1-score Standard Deviation F1-score Lower CI F1-score Upper CI Acc Mean Acc Standard Deviation Acc Lower CI Acc Upper CI
CUPP 0.9917 0.0156 0.9880 0.9954 0.8570 0.0825 0.8373 0.8767 0.9908 0.0172 0.9866 0.9949 0.9167 0.0531 0.9040 0.9293 0.9244 0.0417 0.9144 0.9343
dbCAN 0.9869 0.0245 0.9810 0.9927 0.9087 0.1123 0.8819 0.9355 0.9866 0.0241 0.9808 0.9923 0.9412 0.0796 0.9222 0.9602 0.9478 0.0564 0.9343 0.9612
DIAMOND 0.9844 0.0263 0.9782 0.9907 0.9261 0.1298 0.8952 0.9571 0.9847 0.0251 0.9787 0.9907 0.9481 0.0907 0.9264 0.9697 0.9553 0.0641 0.9400 0.9706
eCAMI 0.9836 0.0257 0.9774 0.9897 0.8610 0.1328 0.8293 0.8927 0.9826 0.0254 0.9765 0.9887 0.9112 0.0868 0.8905 0.9319 0.9223 0.0647 0.9069 0.9377
HMMER 0.9901 0.0163 0.9863 0.9940 0.8831 0.0835 0.8632 0.9030 0.9893 0.0174 0.9851 0.9935 0.9305 0.0613 0.9159 0.9451 0.9366 0.0422 0.9266 0.9467
Hotpep 0.9840 0.0257 0.9779 0.9901 0.8189 0.1327 0.7872 0.8505 0.9815 0.0287 0.9747 0.9884 0.8862 0.0917 0.8643 0.9081 0.9014 0.0666 0.8855 0.9173

Owing to the skewing of the data towards 1, the 95% confidence interval (CI) was calculated and plotted as error bars around the mean CI and illustrated in figure 3.1.

Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.

Figure 3.1: Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.

3.2 Specificity

Specificity is the proportion of known negatives (known non-CAZymes) which are correctly classified as negatives (non-CAZymes).

Figure 3.2 is a graphical representation of the results calculated in table 3.1.

One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 3.2: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

3.3 Sensitivity

Sensitivity (also known as recall) is the proportion of known positives (CAZymes) that are correctly identified as positives (CAZymes).

Figure 3.3 graphically represents of the results calculated in table 3.1.

One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 3.3: One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

3.4 Precision

Precision is the proportion of positive predictions by the classifiers that are correct.

In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes.

Figure 3.4 is a visual representation of the results calculated in table 3.1.

One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 3.4: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

3.5 F1-score

The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.5 shows the F1-score from each test set, for each classifier.

Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

Figure 3.5: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

3.6 Accuracy

Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.6 is a plot of respective data from table 3.1.

Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

Figure 3.6: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

3.7 Expected Range of Accruacy

The statistics evaluated above provide an idea of the general performance of the tools, but they do not provide an idea of the expect range of performance. Specifically, the data does not provide a clear image of the best and worse performance a user can expect when using these tools.

To compare the expected typical range in accuracies for each classifier, 6 test sets (identified by the source genomic assemblies) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times each, and for each bootstrap sample the accuracy calculated. The accuracies of the bootstrap samples for each classifier were plotted on stacked histograms, shown in figure 3.7.

Stacked histograms of bootstrap sample accuracies of CAZyme classifiers' differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.

Figure 3.7: Stacked histograms of bootstrap sample accuracies of CAZyme classifiers’ differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.

3.8 Conclusions on the Binary CAZyme/non-CAZyme Prediction Performance

Overall, all tools showed a low probability of producing false positives (missclassifying a non-CAZyme as a CAZyme), and few of the positive predictions are false positives. Therefore, we can be confident in that the CAZyme predictions made by each of these tools are most likely correct. However, all the classifiers demonstrated a consistent behaviour to not identify all CAZymes within a CAZome. Therefore, we can be confident in the CAZyme predictions, but should not presume all non-CAZyme predictions are correct; these classifiers are unlikely to identify the complete CAZome although a near-complete CAZome will be accurately identified.

dbCAN consistently demonstrated the strongest performance in all categories, inferring that eCAMI and CUPP are not suitable replacements of the CAZyme classifier. Hotpep consistently demonstrated the weakest performance, and is incorporated within dbCAN. Therefore, substituting eCAMI and/or CUPP into dbCAN instead of Hotpep may further improve the performance of dbCAN. The new k-mer based methods, eCAMI and CUPP demonstrated similar performances. CUPP showed a more consistent performance and eCAMI demonstrating a greater range in performance although its mean performance was fractionally greater than that of CUPP. However, more bootstrap calculated accuracy scores feel within the range of 0.9-1.0 for CUPP than eCAMI. This infers that a CUPP may typically provide a better performance than eCAMI, although eCAMI does have the potential on some occasions to out perform CUPP, depending on the test set.

4 CAZy Class classification

CAZy groups CAZymes into CAZy families by sequence similarity, and CAZy families are grouped into one of 6 functional classes. The CAZyme classifiers predict the CAZy family annotations of predicted CAZymes, but it is of interest to see what the level of performance of the classiferis is at the CAZy class level. Specifically, a classifier may struggle to predict the correct CAZy class for a CAZyme but consistently predict the correct CAZy class. Therefore, the aim of this part of the evaluation is to evaluate the performance of the classifiers to predict the correct CAZy class of predict CAZymes.

4.2 Performance per CAZy class

Below the prediction sensitivity is plotted against the specificity for each classifier, and a separate plot is generated for each CAZy class.

The scatter plots of sensitivity against specificity overlay a coloured contour to highlight the distribution of the points. When too many points have the same value a contour cannot be generated. In order to plot a contour noise is added to the data. The original data is used to plot the scatter plot and the data with added noise is used to plot the contour.

The percentage of the data points which need noise to be added to them in order to generate a contour varies from data set to data set. To change the percentage of the data points with noise added to them, change the third value of call to the function plot.class.sens.vs.spec(), which is used to generate the plots. The third value is the percentage of data points to add noise to, written in decimal form.

4.2.1 GH class classification

Scatter plot of sensitivity against specificity for predicting GH CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.4: Scatter plot of sensitivity against specificity for predicting GH CAZy class members per CAZyme classier, overlaying a density map.

Table 4.2: Overall performance of CAZyme classifiers classification of GH class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 0.9955 0.0119 0.9927 0.9983 0.9136 0.0685 0.8973 0.9300 0.9957 0.0112 0.9930 0.9983 0.9514 0.0402 0.9418 0.9609 0.9581 0.0279 0.9514 0.9647
dbCAN 0.9960 0.0110 0.9934 0.9986 0.9209 0.1017 0.8966 0.9451 0.9947 0.0182 0.9903 0.9990 0.9527 0.0661 0.9370 0.9685 0.9625 0.0382 0.9534 0.9716
DIAMOND 0.9934 0.0141 0.9900 0.9967 0.9447 0.1054 0.9196 0.9699 0.9921 0.0224 0.9868 0.9975 0.9639 0.0689 0.9475 0.9803 0.9715 0.0413 0.9617 0.9814
eCAMI 0.9886 0.0205 0.9837 0.9935 0.8834 0.1104 0.8570 0.9097 0.9887 0.0214 0.9836 0.9938 0.9286 0.0660 0.9129 0.9444 0.9423 0.0446 0.9317 0.9529
HMMER 0.9957 0.0108 0.9931 0.9982 0.9151 0.0834 0.8952 0.9350 0.9944 0.0145 0.9909 0.9978 0.9506 0.0583 0.9367 0.9645 0.9585 0.0342 0.9503 0.9667
Hotpep 0.9853 0.0234 0.9797 0.9909 0.8825 0.1106 0.8562 0.9089 0.9842 0.0294 0.9772 0.9912 0.9263 0.0685 0.9100 0.9426 0.9403 0.0424 0.9302 0.9504
Summary statistics of CAZyme classifiers performances of GH class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.5: Summary statistics of CAZyme classifiers performances of GH class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot

Figure 4.6: One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot

Figure 4.7: One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot

Figure 4.8: One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot

Figure 4.9: One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot

Figure 4.10: One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot

4.2.2 GT class classification

Scatter plot of sensitivity against specificity for predicting GT CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.11: Scatter plot of sensitivity against specificity for predicting GT CAZy class members per CAZyme classier, overlaying a density map.

Table 4.3: Overall performance of CAZyme classifiers classification of GT class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 0.9981 0.0078 0.9962 1.0000 0.8657 0.1188 0.8374 0.8940 0.9971 0.0115 0.9944 0.9999 0.9220 0.0759 0.9039 0.9401 0.9493 0.0581 0.9354 0.9632
dbCAN 0.9990 0.0065 0.9975 1.0006 0.8827 0.1393 0.8495 0.9159 0.9988 0.0080 0.9969 1.0007 0.9300 0.0983 0.9065 0.9534 0.9549 0.0727 0.9376 0.9722
DIAMOND 0.9977 0.0086 0.9956 0.9997 0.9314 0.1483 0.8961 0.9668 0.9968 0.0120 0.9940 0.9997 0.9550 0.1052 0.9299 0.9800 0.9702 0.0768 0.9519 0.9885
eCAMI 0.9980 0.0090 0.9958 1.0002 0.8529 0.1627 0.8141 0.8917 0.9978 0.0098 0.9954 1.0001 0.9101 0.1109 0.8837 0.9366 0.9417 0.0800 0.9226 0.9608
HMMER 0.9979 0.0096 0.9956 1.0002 0.8747 0.1080 0.8489 0.9005 0.9980 0.0092 0.9958 1.0002 0.9279 0.0768 0.9095 0.9462 0.9544 0.0532 0.9417 0.9671
Hotpep 0.9984 0.0070 0.9967 1.0001 0.7253 0.1889 0.6802 0.7703 0.9966 0.0132 0.9934 0.9997 0.8242 0.1433 0.7900 0.8584 0.8996 0.0899 0.8782 0.9210
Summary statistics of CAZyme classifiers performances of GT class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.12: Summary statistics of CAZyme classifiers performances of GT class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot

Figure 4.13: One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot

Figure 4.14: One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot

Figure 4.15: One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot

Figure 4.16: One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot

Figure 4.17: One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot

4.2.3 PL class classification

Scatter plot of sensitivity against specificity for predicting PL CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.18: Scatter plot of sensitivity against specificity for predicting PL CAZy class members per CAZyme classier, overlaying a density map.

Table 4.4: Overall performance of CAZyme classifiers classification of PL class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 0.9992 0.0028 0.9983 1.0001 0.7751 0.3573 0.6576 0.8925 0.8493 0.3421 0.7369 0.9618 0.7957 0.3402 0.6839 0.9075 0.9919 0.0141 0.9872 0.9965
dbCAN 1.0000 0.0000 1.0000 1.0000 0.8600 0.2674 0.7721 0.9479 0.9474 0.2263 0.8730 1.0217 0.8911 0.2451 0.8105 0.9716 0.9950 0.0083 0.9923 0.9978
DIAMOND 0.9995 0.0023 0.9987 1.0002 0.8838 0.2641 0.7970 0.9706 0.9305 0.2375 0.8524 1.0085 0.8948 0.2479 0.8133 0.9763 0.9958 0.0072 0.9935 0.9982
eCAMI 0.9992 0.0028 0.9983 1.0001 0.7547 0.3215 0.6505 0.8589 0.8880 0.3069 0.7886 0.9875 0.8035 0.3049 0.7047 0.9023 0.9901 0.0154 0.9851 0.9951
HMMER 0.9995 0.0033 0.9984 1.0006 0.8884 0.2465 0.8074 0.9694 0.9342 0.2374 0.8562 1.0122 0.9061 0.2372 0.8281 0.9840 0.9955 0.0083 0.9928 0.9983
Hotpep 0.9985 0.0053 0.9968 1.0003 0.8213 0.2927 0.7251 0.9175 0.9089 0.2738 0.8189 0.9989 0.8534 0.2736 0.7635 0.9434 0.9917 0.0155 0.9866 0.9967
Summary statistics of CAZyme classifiers performances of PL class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.19: Summary statistics of CAZyme classifiers performances of PL class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot

Figure 4.20: One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot

Figure 4.21: One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot

Figure 4.22: One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot

Figure 4.23: One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot

Figure 4.24: One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot

4.2.4 CE class classification

Scatter plot of sensitivity against specificity for predicting CE CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.25: Scatter plot of sensitivity against specificity for predicting CE CAZy class members per CAZyme classier, overlaying a density map.

Table 4.5: Overall performance of CAZyme classifiers classification of CE class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 0.9977 0.0085 0.9956 0.9997 0.9352 0.1498 0.8986 0.9717 0.9606 0.1455 0.9252 0.9961 0.9429 0.1383 0.9092 0.9766 0.9946 0.0095 0.9923 0.9969
dbCAN 0.9959 0.0161 0.9920 0.9998 0.9646 0.1464 0.9289 1.0003 0.9520 0.1664 0.9114 0.9926 0.9510 0.1507 0.9142 0.9877 0.9948 0.0155 0.9910 0.9986
DIAMOND 0.9958 0.0167 0.9917 0.9998 0.9174 0.2219 0.8632 0.9715 0.9361 0.2050 0.8861 0.9861 0.9128 0.2107 0.8614 0.9642 0.9925 0.0182 0.9880 0.9969
eCAMI 0.9941 0.0166 0.9901 0.9982 0.8396 0.2646 0.7751 0.9041 0.8992 0.2344 0.8421 0.9564 0.8490 0.2384 0.7909 0.9072 0.9885 0.0176 0.9842 0.9928
HMMER 0.9977 0.0081 0.9957 0.9996 0.9493 0.1129 0.9217 0.9768 0.9748 0.0772 0.9560 0.9936 0.9554 0.0794 0.9360 0.9748 0.9952 0.0089 0.9930 0.9973
Hotpep 0.9933 0.0173 0.9891 0.9975 0.8945 0.2320 0.8379 0.9511 0.8950 0.2385 0.8368 0.9532 0.8832 0.2235 0.8286 0.9377 0.9896 0.0176 0.9853 0.9939
Summary statistics of CAZyme classifiers performances of CE class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.26: Summary statistics of CAZyme classifiers performances of CE class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot

Figure 4.27: One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot

Figure 4.28: One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot

Figure 4.29: One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot

Figure 4.30: One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot

Figure 4.31: One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot

4.2.5 AA class classification

Scatter plot of sensitivity against specificity for predicting AA CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.32: Scatter plot of sensitivity against specificity for predicting AA CAZy class members per CAZyme classier, overlaying a density map.

Table 4.6: Overall performance of CAZyme classifiers classification of AA class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 0.9930 0.0187 0.9868 0.9993 0.9165 0.1213 0.8760 0.9569 0.9383 0.1497 0.8884 0.9882 0.9169 0.1226 0.8760 0.9578 0.9862 0.0248 0.9779 0.9945
dbCAN 0.9930 0.0196 0.9865 0.9996 0.9372 0.1147 0.8989 0.9754 0.9390 0.1492 0.8892 0.9887 0.9294 0.1241 0.8881 0.9708 0.9886 0.0251 0.9803 0.9970
DIAMOND 0.9930 0.0194 0.9866 0.9995 0.8796 0.2475 0.7971 0.9622 0.9143 0.2099 0.8443 0.9843 0.8743 0.2267 0.7987 0.9499 0.9872 0.0225 0.9797 0.9947
eCAMI 0.9936 0.0169 0.9880 0.9992 0.8422 0.1926 0.7780 0.9064 0.9374 0.1505 0.8872 0.9876 0.8679 0.1556 0.8160 0.9198 0.9818 0.0273 0.9727 0.9909
HMMER 0.9925 0.0187 0.9862 0.9987 0.9671 0.0673 0.9447 0.9896 0.9345 0.1462 0.8857 0.9832 0.9429 0.1066 0.9073 0.9784 0.9891 0.0218 0.9818 0.9963
Hotpep 0.9928 0.0201 0.9861 0.9995 0.9225 0.1319 0.8785 0.9664 0.9370 0.1536 0.8858 0.9883 0.9190 0.1311 0.8753 0.9627 0.9873 0.0256 0.9788 0.9958
Summary statistics of CAZyme classifiers performances of AA class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.33: Summary statistics of CAZyme classifiers performances of AA class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot

Figure 4.34: One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot

Figure 4.35: One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot

Figure 4.36: One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot

Figure 4.37: One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot

Figure 4.38: One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot

4.2.6 CBM class classification

Scatter plot of sensitivity against specificity for predicting CBM CAZy class members per CAZyme classier, overlaying a density map.

Figure 4.39: Scatter plot of sensitivity against specificity for predicting CBM CAZy class members per CAZyme classier, overlaying a density map.

Table 4.7: Overall performance of CAZyme classifiers classification of CBM class members
Prediction_tool Spec Mean Spec Standard Deviation Spec CI Lower Spec CI Upper Sens Mean Sens Standard Deviation Sens CI Lower Sens CI Upper Prec Mean Prec Standard Deviation Prec CI Lower Prec CI Upper F1-score Mean F1-score Standard Deviation F1-score CI Lower F1-score CI Upper Acc Mean Acc Standard Deviation Acc CI Lower Acc CI Upper
CUPP 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8852 0.0898 0.8638 0.9066
dbCAN 0.9937 0.0103 0.9912 0.9962 0.8007 0.1975 0.7536 0.8478 0.9254 0.1243 0.8958 0.9551 0.8433 0.1547 0.8064 0.8802 0.9729 0.0272 0.9664 0.9794
DIAMOND 0.9947 0.0101 0.9923 0.9971 0.8659 0.2031 0.8175 0.9143 0.9429 0.1395 0.9097 0.9762 0.8924 0.1669 0.8526 0.9322 0.9820 0.0235 0.9764 0.9876
eCAMI 0.9482 0.0513 0.9359 0.9604 0.8116 0.2202 0.7591 0.8641 0.6838 0.1874 0.6391 0.7285 0.7218 0.1762 0.6798 0.7638 0.9354 0.0512 0.9232 0.9476
HMMER 0.9960 0.0082 0.9941 0.9980 0.4666 0.2448 0.4082 0.5250 0.9069 0.2101 0.8568 0.9570 0.5792 0.2200 0.5267 0.6316 0.9420 0.0332 0.9341 0.9499
Hotpep 0.9013 0.0565 0.8878 0.9148 0.7851 0.2226 0.7320 0.8381 0.4862 0.1634 0.4473 0.5252 0.5823 0.1646 0.5430 0.6215 0.8898 0.0560 0.8765 0.9032
Summary statistics of CAZyme classifiers performances of CBM class classification, plotting the mean plus and minus the 95% confidence interval.

Figure 4.40: Summary statistics of CAZyme classifiers performances of CBM class classification, plotting the mean plus and minus the 95% confidence interval.

One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot

Figure 4.41: One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot

Figure 4.42: One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot

Figure 4.43: One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot

Figure 4.44: One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot

Figure 4.45: One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot

4.3 Rand Index and Adjusted Rand Index of CAZy Class Prediction

A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.

The RI is the measure of accuracy across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct). The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct.

Table 4.8: Adjusted Rand Index of CAZyme classifier classification of CAZy class annotations
Prediction_tool Lower CI Mean Upper CI Standard Deviation
dbCAN 0.9359 0.9398 0.9437 0.2359
HMMER 0.9226 0.9268 0.9310 0.2537
DIAMOND 0.9510 0.9545 0.9579 0.2079
Hotpep 0.8653 0.8706 0.8759 0.3212
CUPP 0.8960 0.9007 0.9054 0.2852
eCAMI 0.9013 0.9060 0.9107 0.2836

Plot are violin plots underlying scatter plots, presenting the RI and ARI for every protein across all test sets.

Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.46: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.47: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.48: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 4.49: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

5 CAZy family classification

The following section evaluates the performance of the CAZyme classifiers to predict CAZy family classifications.

5.2 Performance per CAZy family

To evaluate the performance of predicting each CAZy family independent of all other CAZy families, the sensitivity and precision for each CAZy family, for each CAZyme classifier was calculated and plotted against each other (Fig.??). Whereas sensitivity was plotted against sensitivity for CAZy classes, owing to the extremely small variation in specificity scores, sensitivity was plotted as a percentage against log10 of the specificity percentage.

The following plots present the specificity (Fig.5.2), sensitivity (Fig.5.3), precision (Fig.5.4), F1-score (Fig.5.5) and accuracy (Fig.5.6) for each CAZy family per classifier. In accompaniment to each plot is a table summarising the mean statistic value for each classifier across all CAZy families for each CAZy class.

5.2.1 Specificity

Table 5.3: Specificity of CAZy family classification per CAZyme classifier
CAZy_class Prediction_tool Mean Standard Deviation Lower CI Upper CI
CBM dbCAN 0.9999 0.0003 0.9998 1.0000
CBM HMMER 0.9999 0.0004 0.9998 1.0000
CBM DIAMOND 0.9999 0.0002 0.9998 1.0000
CBM Hotpep 0.9974 0.0038 0.9965 0.9984
CBM CUPP 1.0000 0.0000 1.0000 1.0000
CBM eCAMI 0.9989 0.0017 0.9985 0.9994
AA dbCAN 0.9997 0.0006 0.9994 1.0001
AA HMMER 0.9997 0.0006 0.9993 1.0000
AA DIAMOND 0.9997 0.0007 0.9993 1.0001
AA Hotpep 0.9997 0.0006 0.9994 1.0001
AA CUPP 0.9997 0.0006 0.9994 1.0001
AA eCAMI 0.9998 0.0005 0.9995 1.0001
CE dbCAN 0.9997 0.0007 0.9993 1.0000
CE HMMER 0.9998 0.0003 0.9996 0.9999
CE DIAMOND 0.9997 0.0007 0.9993 1.0001
CE Hotpep 0.9995 0.0007 0.9992 0.9999
CE CUPP 0.9998 0.0004 0.9996 1.0000
CE eCAMI 0.9996 0.0007 0.9992 1.0000
PL dbCAN 1.0000 0.0000 1.0000 1.0000
PL HMMER 1.0000 0.0001 0.9999 1.0000
PL DIAMOND 1.0000 0.0001 1.0000 1.0000
PL Hotpep 1.0000 0.0001 0.9999 1.0000
PL CUPP 1.0000 0.0001 0.9999 1.0000
PL eCAMI 1.0000 0.0001 0.9999 1.0000
GT dbCAN 1.0000 0.0001 1.0000 1.0000
GT HMMER 0.9999 0.0002 0.9999 1.0000
GT DIAMOND 1.0000 0.0001 0.9999 1.0000
GT Hotpep 1.0000 0.0002 0.9999 1.0000
GT CUPP 1.0000 0.0001 1.0000 1.0000
GT eCAMI 1.0000 0.0002 0.9999 1.0000
GH dbCAN 1.0000 0.0001 0.9999 1.0000
GH HMMER 1.0000 0.0002 0.9999 1.0000
GH DIAMOND 1.0000 0.0001 0.9999 1.0000
GH Hotpep 0.9998 0.0006 0.9998 0.9999
GH CUPP 1.0000 0.0001 1.0000 1.0000
GH eCAMI 0.9999 0.0003 0.9998 1.0000
Scatter plot of overlaying a one-dimensional box-and-whisker plot of specificity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

Figure 5.2: Scatter plot of overlaying a one-dimensional box-and-whisker plot of specificity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

5.2.2 Sensitivity

Table 5.4: Sensitivity of CAZy family classification per CAZyme classifier
CAZy_class Prediction_tool Mean Standard Deviation Lower CI Upper CI
CBM dbCAN 0.8945 0.2152 0.8333 0.9556
CBM HMMER 0.7766 0.3606 0.6752 0.8781
CBM DIAMOND 0.9052 0.2201 0.8427 0.9678
CBM Hotpep 0.6006 0.4159 0.4975 0.7037
CBM CUPP 0.0000 0.0000 0.0000 0.0000
CBM eCAMI 0.6069 0.3853 0.5082 0.7056
AA dbCAN 0.8132 0.2928 0.6442 0.9822
AA HMMER 0.8899 0.2706 0.7336 1.0461
AA DIAMOND 0.8040 0.2939 0.6343 0.9737
AA Hotpep 0.8159 0.2935 0.6464 0.9854
AA CUPP 0.7194 0.4076 0.4841 0.9547
AA eCAMI 0.6972 0.3735 0.4816 0.9129
CE dbCAN 0.8724 0.2887 0.7186 1.0262
CE HMMER 0.9244 0.2487 0.7919 1.0569
CE DIAMOND 0.8481 0.2655 0.7067 0.9896
CE Hotpep 0.7921 0.3132 0.6252 0.9589
CE CUPP 0.8504 0.3356 0.6716 1.0292
CE eCAMI 0.7749 0.2659 0.6332 0.9165
PL dbCAN 0.8076 0.3628 0.6468 0.9685
PL HMMER 0.8571 0.3137 0.7180 0.9962
PL DIAMOND 0.8287 0.3489 0.6740 0.9834
PL Hotpep 0.6768 0.3560 0.5189 0.8346
PL CUPP 0.6055 0.4310 0.4144 0.7966
PL eCAMI 0.6159 0.4221 0.4288 0.8031
GT dbCAN 0.8586 0.2695 0.7943 0.9228
GT HMMER 0.8397 0.3045 0.7682 0.9113
GT DIAMOND 0.8729 0.2621 0.8104 0.9354
GT Hotpep 0.7537 0.3274 0.6757 0.8318
GT CUPP 0.8073 0.3050 0.7345 0.8800
GT eCAMI 0.7635 0.3165 0.6880 0.8389
GH dbCAN 0.9252 0.1885 0.8917 0.9588
GH HMMER 0.9188 0.2165 0.8806 0.9570
GH DIAMOND 0.9258 0.1923 0.8916 0.9600
GH Hotpep 0.8558 0.2556 0.8106 0.9011
GH CUPP 0.8172 0.3475 0.7554 0.8789
GH eCAMI 0.8036 0.3017 0.7499 0.8572
Scatter plot of overlaying a one-dimensional box-and-whisker plot of sensitivity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

Figure 5.3: Scatter plot of overlaying a one-dimensional box-and-whisker plot of sensitivity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

5.2.3 Precision

Table 5.5: Precision of CAZy family classification per CAZyme classifier
CAZy_class Prediction_tool Mean Standard Deviation Lower CI Upper CI
CBM dbCAN 0.9012 0.2358 0.8341 0.9682
CBM HMMER 0.8416 0.3438 0.7449 0.9383
CBM DIAMOND 0.9042 0.2194 0.8418 0.9665
CBM Hotpep 0.2739 0.2989 0.1999 0.3480
CBM CUPP 0.0000 0.0000 0.0000 0.0000
CBM eCAMI 0.4427 0.3508 0.3528 0.5325
AA dbCAN 0.9149 0.1534 0.8263 1.0035
AA HMMER 0.8225 0.2862 0.6572 0.9877
AA DIAMOND 0.8832 0.1772 0.7809 0.9855
AA Hotpep 0.9156 0.1526 0.8274 1.0037
AA CUPP 0.7480 0.3840 0.5263 0.9698
AA eCAMI 0.7846 0.3565 0.5787 0.9904
CE dbCAN 0.8256 0.3191 0.6556 0.9957
CE HMMER 0.8026 0.2910 0.6475 0.9576
CE DIAMOND 0.8379 0.3034 0.6763 0.9996
CE Hotpep 0.7979 0.3115 0.6320 0.9639
CE CUPP 0.8336 0.3361 0.6545 1.0127
CE eCAMI 0.8144 0.3100 0.6492 0.9796
PL dbCAN 0.8636 0.3513 0.7079 1.0194
PL HMMER 0.8538 0.3256 0.7094 0.9982
PL DIAMOND 0.8506 0.3513 0.6949 1.0064
PL Hotpep 0.8628 0.3509 0.7072 1.0184
PL CUPP 0.7154 0.4494 0.5162 0.9147
PL eCAMI 0.7240 0.4539 0.5227 0.9252
GT dbCAN 0.9418 0.2336 0.8861 0.9975
GT HMMER 0.8765 0.3003 0.8059 0.9470
GT DIAMOND 0.9400 0.2335 0.8843 0.9957
GT Hotpep 0.8917 0.3018 0.8197 0.9636
GT CUPP 0.9054 0.2815 0.8383 0.9726
GT eCAMI 0.8950 0.3014 0.8232 0.9669
GH dbCAN 0.9639 0.1776 0.9324 0.9955
GH HMMER 0.9331 0.2176 0.8947 0.9714
GH DIAMOND 0.9585 0.1823 0.9261 0.9909
GH Hotpep 0.9138 0.2502 0.8695 0.9581
GH CUPP 0.8524 0.3452 0.7911 0.9138
GH eCAMI 0.8837 0.2975 0.8308 0.9366
Scatter plot of overlaying a one-dimensional box-and-whisker plot of precision for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

Figure 5.4: Scatter plot of overlaying a one-dimensional box-and-whisker plot of precision for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

5.2.4 F1-score

Table 5.6: Sensitivity of CAZy family classification per CAZyme classifier
CAZy_class Prediction_tool Mean Standard Deviation Lower CI Upper CI
CBM dbCAN 0.8863 0.2176 0.8245 0.9482
CBM HMMER 0.7755 0.3559 0.6754 0.8756
CBM DIAMOND 0.8980 0.2136 0.8373 0.9587
CBM Hotpep 0.3402 0.3115 0.2630 0.4174
CBM CUPP 0.0000 0.0000 0.0000 0.0000
CBM eCAMI 0.4819 0.3354 0.3960 0.5678
AA dbCAN 0.8173 0.2539 0.6707 0.9639
AA HMMER 0.8479 0.2701 0.6919 1.0038
AA DIAMOND 0.8119 0.2539 0.6653 0.9584
AA Hotpep 0.8187 0.2537 0.6723 0.9652
AA CUPP 0.7123 0.3913 0.4864 0.9383
AA eCAMI 0.7102 0.3681 0.4976 0.9227
CE dbCAN 0.8408 0.2993 0.6813 1.0003
CE HMMER 0.8443 0.2639 0.7036 0.9849
CE DIAMOND 0.8357 0.2800 0.6865 0.9849
CE Hotpep 0.7720 0.2980 0.6132 0.9307
CE CUPP 0.8388 0.3310 0.6625 1.0152
CE eCAMI 0.7791 0.2716 0.6344 0.9238
PL dbCAN 0.8263 0.3540 0.6693 0.9832
PL HMMER 0.8363 0.3085 0.6995 0.9731
PL DIAMOND 0.8372 0.3471 0.6832 0.9911
PL Hotpep 0.7390 0.3413 0.5877 0.8903
PL CUPP 0.6396 0.4241 0.4515 0.8276
PL eCAMI 0.6549 0.4277 0.4652 0.8445
GT dbCAN 0.8869 0.2587 0.8252 0.9486
GT HMMER 0.8502 0.2961 0.7806 0.9198
GT DIAMOND 0.8958 0.2563 0.8347 0.9569
GT Hotpep 0.7995 0.3111 0.7253 0.8736
GT CUPP 0.8411 0.2886 0.7723 0.9099
GT eCAMI 0.8110 0.3055 0.7382 0.8838
GH dbCAN 0.9422 0.1807 0.9101 0.9743
GH HMMER 0.9169 0.2164 0.8787 0.9550
GH DIAMOND 0.9386 0.1840 0.9059 0.9713
GH Hotpep 0.8782 0.2473 0.8344 0.9219
GH CUPP 0.8280 0.3454 0.7666 0.8894
GH eCAMI 0.8333 0.2931 0.7812 0.8854
Scatter plot of overlaying a one-dimensional box-and-whisker plot of the F1-score for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

Figure 5.5: Scatter plot of overlaying a one-dimensional box-and-whisker plot of the F1-score for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

5.2.5 Accuracy

Table 5.7: Accuracy of CAZy family classification per CAZyme classifier
CAZy_class Prediction_tool Mean Standard Deviation Lower CI Upper CI
CBM dbCAN 0.9994 0.0016 0.9990 0.9999
CBM HMMER 0.9988 0.0036 0.9978 0.9998
CBM DIAMOND 0.9996 0.0009 0.9994 0.9999
CBM Hotpep 0.9970 0.0042 0.9960 0.9981
CBM CUPP 0.9976 0.0043 0.9964 0.9988
CBM eCAMI 0.9985 0.0022 0.9979 0.9991
AA dbCAN 0.9995 0.0008 0.9990 0.9999
AA HMMER 0.9994 0.0009 0.9989 1.0000
AA DIAMOND 0.9994 0.0008 0.9989 0.9999
AA Hotpep 0.9995 0.0007 0.9991 0.9999
AA CUPP 0.9994 0.0007 0.9990 0.9998
AA eCAMI 0.9993 0.0008 0.9988 0.9998
CE dbCAN 0.9996 0.0009 0.9990 1.0001
CE HMMER 0.9995 0.0007 0.9991 0.9999
CE DIAMOND 0.9994 0.0010 0.9989 0.9999
CE Hotpep 0.9993 0.0010 0.9987 0.9998
CE CUPP 0.9996 0.0006 0.9993 0.9999
CE eCAMI 0.9992 0.0011 0.9986 0.9998
PL dbCAN 0.9999 0.0002 0.9998 1.0000
PL HMMER 0.9999 0.0003 0.9997 1.0000
PL DIAMOND 0.9999 0.0002 0.9998 1.0000
PL Hotpep 0.9998 0.0003 0.9997 0.9999
PL CUPP 0.9998 0.0003 0.9997 0.9999
PL eCAMI 0.9997 0.0004 0.9996 0.9999
GT dbCAN 0.9993 0.0019 0.9989 0.9998
GT HMMER 0.9992 0.0023 0.9987 0.9998
GT DIAMOND 0.9995 0.0010 0.9993 0.9998
GT Hotpep 0.9985 0.0052 0.9973 0.9998
GT CUPP 0.9992 0.0018 0.9988 0.9997
GT eCAMI 0.9991 0.0020 0.9986 0.9996
GH dbCAN 0.9997 0.0011 0.9995 0.9999
GH HMMER 0.9996 0.0015 0.9993 0.9999
GH DIAMOND 0.9998 0.0006 0.9996 0.9999
GH Hotpep 0.9994 0.0015 0.9991 0.9997
GH CUPP 0.9996 0.0013 0.9994 0.9999
GH eCAMI 0.9994 0.0011 0.9993 0.9996
Scatter plot of overlaying a one-dimensional box-and-whisker plot of the accuracy for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

Figure 5.6: Scatter plot of overlaying a one-dimensional box-and-whisker plot of the accuracy for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.

5.2.6 Plotting senstivity against specificity

For better resolution we can group the CAZy families by their parent CAzy classes, and compare the performances of the tools CAZy class, by CAZy class. Owing to the minimal variation in specificity scores, specificity was plotted as the percentage specificity log10.

5.2.7 Glycoside Hydrolases

Figure 5.7 shows the plotting of sensitivity against specificity for each Glycoside Hydrolase CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.

Figure 5.7: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.

5.2.8 Glycosyltransferases

Figure 5.8 shows the plotting of sensitivity against specificity for each Glycosyltransferases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.

Figure 5.8: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.

5.2.9 Polysaccharide Lyases

Figure 5.7 shows the plotting of sensitivity against specificity for each Polysaccharide Lyases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.

Figure 5.9: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.

5.2.10 Carbohydrate Esterases

Figure 5.10 shows the plotting of sensitivity against specificity for each Carbohydrate Esterases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.

Figure 5.10: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.

5.2.11 Auxillary Activities

Figure ?? shows the plotting of sensitivity against specificity for each Auxillary Activities CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.

Figure 5.11: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.

5.2.12 Carbohydate Binding Modules

Figure 5.12 shows the plotting of sensitivity against specificity for each Carbohydrate Binding Module CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.

Figure 5.12: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.

5.3 Consistently poor performing CAZy families

We then pulled out the CAZy families with which at least three classifiers produced a sensitivity score of less than 0.75.

5.3.1 GH difficult to classify families

5.3.2 GT diffcult to classify families

5.3.3 PL diffcult to classify families

5.3.4 CE diffcult to classify families

5.3.5 AA diffcult to classify families

5.3.6 CBM diffcult to classify families

5.4 Evaluation of multi-label CAZy family classification performance

CAZy annotates proteins in a domain-wise manner. Consequently, a single protein may be assigned to multiple CAZy families. The ability of a classifier to assign all the correct CAZy family annotations for a given protein when only evaluating the CAZy family classification performance per CAZy family, independently of all other CAZy classes.

The CAZy family multi-label classification performance is represented by the Rand Index (RI) and Adjusted Rand Index (ARI). The RI is a quantitive measure of similarity between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. In this case the two clusters are the predicted and groud truth CAZy family annotations. The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI) This produces a score between 1 and -1. A score of 1 is produced if all predicted and known CAZy family annotations are identical, 0 if completely random clustering of -1 if systematically incorrect clustering and the number of incorrect classifications of proteins is greater than would be expected from randomly annotating proteins with CAZy families.

Table 5.8: Rand Index of CAZyme classifier classification of CAZy family annotations
Prediction_tool Mean Standard Deviation Lower CI Upper CI
dbCAN 0.9997 0.0011 0.9997 0.9997
HMMER 0.9996 0.0014 0.9996 0.9996
DIAMOND 0.9998 0.0010 0.9998 0.9998
Hotpep 0.9991 0.0023 0.9991 0.9991
CUPP 0.9995 0.0015 0.9994 0.9995
eCAMI 0.9994 0.0017 0.9994 0.9995
Table 5.9: Adjusted Rand Index of CAZyme classifier classification of CAZy family annotations
Prediction_tool Mean Standard Deviation Lower CI Upper CI
dbCAN 0.9391 0.2359 0.9352 0.9430
HMMER 0.9250 0.2554 0.9208 0.9292
DIAMOND 0.9530 0.2105 0.9495 0.9565
Hotpep 0.8758 0.3083 0.8707 0.8809
CUPP 0.9098 0.2712 0.9053 0.9143
eCAMI 0.9077 0.2778 0.9031 0.9123

Multilabel classification raises when a single instance can be assinged to multiple classes. In this evaluation a single instance is a protein and the classes are CAZy families, a single CAZyme can be assigned to multiple CAZy families. This is important to take into consideration because the same approaches for statistical evaluation of binary classification provided a limited view of the performance of the classifiers when applied to multilabel classification.

Plot are violin plots overlayed by scatter plots of the Rand Index and Adjusted Rand Index for every protein in every test set, excluding true negatives.

Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 5.13: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 5.14: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 5.15: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy families

Figure 5.16: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy families

6 Performance per taxonomy group

The performance for a classifier per taxonomy group may vary. For this evaluation the test sets were separated into the taxonomy groups: - Bacteria - Eukaryote

The evaluation per classifier per taxonomy group, versus all test sets pooled together was evaluated.

6.1 Binary classification of CAZymes and non-CAZymes

Here we calculate the mean plus and minus the standard deviation of the F1-score of each prediction tool for each taxonomy group, to represent the overall performance per taxonomy group.

Table 6.1: The F1-score of binary CAZyme/non-CAZyme classification by CAZy classifiers per taxonomy group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Euk Mean Euk Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9217 0.0522 0.9048 0.9386 0.9103 0.0545 0.8903 0.9303 0.9167 0.0531 0.9040 0.9293
dbCAN 0.9434 0.0782 0.9180 0.9687 0.9385 0.0826 0.9082 0.9688 0.9412 0.0796 0.9222 0.9602
DIAMOND 0.9481 0.0919 0.9183 0.9779 0.9480 0.0908 0.9147 0.9813 0.9481 0.0907 0.9264 0.9697
eCAMI 0.9270 0.0763 0.9023 0.9518 0.8913 0.0960 0.8560 0.9265 0.9112 0.0868 0.8905 0.9319
HMMER 0.9210 0.0791 0.8953 0.9466 0.9425 0.0215 0.9346 0.9503 0.9305 0.0613 0.9159 0.9451
Hotpep 0.8898 0.0774 0.8647 0.9149 0.8817 0.1083 0.8420 0.9214 0.8862 0.0917 0.8643 0.9081
95% confidence interval around the mean F1-score of the binary classification of CAZymes and non-CAZymes per taxonomic group.

Figure 6.1: 95% confidence interval around the mean F1-score of the binary classification of CAZymes and non-CAZymes per taxonomic group.

6.1.1 Specificity

One dimensional scatter plot overlaying a box and whisker plot of the specificity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

Figure 6.2: One dimensional scatter plot overlaying a box and whisker plot of the specificity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

6.1.2 Sensitivity

One dimensional scatter plot overlaying a box and whisker plot of the sensitivity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

Figure 6.3: One dimensional scatter plot overlaying a box and whisker plot of the sensitivity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

6.1.3 Percision

One dimensional scatter plot overlaying a box and whisker plot of the precision of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

Figure 6.4: One dimensional scatter plot overlaying a box and whisker plot of the precision of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

6.1.4 F1-score

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

Figure 6.5: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

6.1.5 Accuracy

One dimensional scatter plot overlaying a box and whisker plot of the accuracy of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

Figure 6.6: One dimensional scatter plot overlaying a box and whisker plot of the accuracy of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.

6.2 CAZy class classification

Below a table containing the mean F1-score plus/minus standard deviation for per CAZyme classifier per taxonomy group is presented, in order to represent the overall performance per CAZyme classifier per taxonomy group for all CAZy class classification.

Table 6.2: Overall performance (represented by the F1-score) of CAZy class classification by CAZy classifiers per taxonomy group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Euk Mean Euk Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9217 0.0522 0.9048 0.9386 0.7370 0.3863 0.6765 0.7975 0.9170 0.1460 0.9017 0.9323
dbCAN 0.9434 0.0782 0.9180 0.9687 0.9139 0.1394 0.8921 0.9358 0.8675 0.2013 0.8464 0.8886
DIAMOND 0.9481 0.0919 0.9183 0.9779 0.9110 0.1924 0.8808 0.9411 0.9213 0.1725 0.9032 0.9394
eCAMI 0.9270 0.0763 0.9023 0.9518 0.8340 0.2013 0.8025 0.8656 0.8207 0.2116 0.7985 0.8429
HMMER 0.9210 0.0791 0.8953 0.9466 0.8638 0.2186 0.8296 0.8981 0.7343 0.3937 0.6930 0.7756
Hotpep 0.8898 0.0774 0.8647 0.9149 0.8148 0.2185 0.7806 0.8490 0.8487 0.1950 0.8282 0.8691
95% confidence interval around the mean F1-score of the classification of CAZy classes per taxonomic group.

Figure 6.7: 95% confidence interval around the mean F1-score of the classification of CAZy classes per taxonomic group.

To evaluate the difference between the taxonomic kingdoms per CAZy class, the data was separated into each of the CAZy classes. The F1-score was then plotted as a one-dimensional scatter plot overlaying a boxplot, with data grouped by the taxonomic kingdom and facet wrapped by classifier.

6.2.1 Difference in taxonomic performance for GH classification

Figure @ref{fig:ghClassTax} plots a summary the difference in performance between bacterial and eukaryota GH class members.

Overall, the classifiers demonstrated similar performances between the bacterial and eukaryotic test sets. eCAMI showed the greater difference in performance between bacteria and eukaryotes, demonstrating a more consistent perforamnce against bacterial proteins, as inferred from the smaller interquartile range.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GH class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.8: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GH class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

6.2.1.1 Specificity

Table 6.3: Overall performance of CAZyme classifiers for the classification of bacterial GH class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 0.9927 0.0151 0.9878 0.9976 0.9202 0.0786 0.8948 0.9457 0.9938 0.0132 0.9895 0.9981 0.9536 0.0469 0.9384 0.9688 0.9563 0.0323 0.9458 0.9668
dbCAN 0.9942 0.0138 0.9897 0.9986 0.9281 0.1014 0.8952 0.9609 0.9924 0.0233 0.9848 0.9999 0.9555 0.0691 0.9331 0.9779 0.9613 0.0401 0.9483 0.9743
DIAMOND 0.9894 0.0174 0.9838 0.9951 0.9439 0.1112 0.9079 0.9800 0.9878 0.0287 0.9785 0.9971 0.9608 0.0737 0.9369 0.9847 0.9670 0.0468 0.9518 0.9821
eCAMI 0.9813 0.0244 0.9734 0.9892 0.9205 0.1000 0.8881 0.9529 0.9823 0.0256 0.9740 0.9906 0.9469 0.0585 0.9280 0.9659 0.9504 0.0484 0.9347 0.9661
HMMER 0.9933 0.0136 0.9889 0.9977 0.9106 0.1074 0.8758 0.9454 0.9926 0.0169 0.9871 0.9980 0.9457 0.0762 0.9210 0.9704 0.9527 0.0430 0.9388 0.9666
Hotpep 0.9763 0.0274 0.9674 0.9851 0.9070 0.0890 0.8782 0.9359 0.9755 0.0359 0.9638 0.9871 0.9375 0.0561 0.9193 0.9557 0.9425 0.0402 0.9295 0.9556
Table 6.4: Overall performance of CAZyme classifiers for the classification of eukaryote GH class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 0.9990 0.0038 0.9976 1.0004 0.9053 0.0532 0.8858 0.9248 0.9981 0.0075 0.9953 1.0008 0.9486 0.0302 0.9375 0.9597 0.9603 0.0213 0.9525 0.9681
dbCAN 0.9983 0.0053 0.9964 1.0003 0.9118 0.1032 0.8739 0.9496 0.9976 0.0077 0.9947 1.0004 0.9493 0.0630 0.9262 0.9725 0.9640 0.0362 0.9507 0.9773
DIAMOND 0.9983 0.0053 0.9964 1.0003 0.9458 0.0995 0.9093 0.9823 0.9976 0.0076 0.9948 1.0004 0.9678 0.0634 0.9445 0.9911 0.9773 0.0330 0.9652 0.9893
eCAMI 0.9978 0.0073 0.9952 1.0005 0.8366 0.1063 0.7976 0.8756 0.9968 0.0103 0.9930 1.0005 0.9056 0.0686 0.8805 0.9308 0.9321 0.0376 0.9183 0.9459
HMMER 0.9986 0.0044 0.9970 1.0002 0.9207 0.0362 0.9074 0.9340 0.9967 0.0106 0.9928 1.0005 0.9568 0.0196 0.9496 0.9640 0.9658 0.0161 0.9599 0.9717
Hotpep 0.9967 0.0083 0.9936 0.9997 0.8517 0.1278 0.8049 0.8986 0.9952 0.0116 0.9909 0.9994 0.9122 0.0802 0.8828 0.9416 0.9376 0.0455 0.9209 0.9542
Table 6.5: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote GH class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 0.9955 0.0119 0.9927 0.9983 0.9136 0.0685 0.8973 0.9300 0.9957 0.0112 0.9930 0.9983 0.9514 0.0402 0.9418 0.9609 0.9581 0.0279 0.9514 0.9647
dbCAN 0.9960 0.0110 0.9934 0.9986 0.9209 0.1017 0.8966 0.9451 0.9947 0.0182 0.9903 0.9990 0.9527 0.0661 0.9370 0.9685 0.9625 0.0382 0.9534 0.9716
DIAMOND 0.9934 0.0141 0.9900 0.9967 0.9447 0.1054 0.9196 0.9699 0.9921 0.0224 0.9868 0.9975 0.9639 0.0689 0.9475 0.9803 0.9715 0.0413 0.9617 0.9814
eCAMI 0.9886 0.0205 0.9837 0.9935 0.8834 0.1104 0.8570 0.9097 0.9887 0.0214 0.9836 0.9938 0.9286 0.0660 0.9129 0.9444 0.9423 0.0446 0.9317 0.9529
HMMER 0.9957 0.0108 0.9931 0.9982 0.9151 0.0834 0.8952 0.9350 0.9944 0.0145 0.9909 0.9978 0.9506 0.0583 0.9367 0.9645 0.9585 0.0342 0.9503 0.9667
Hotpep 0.9853 0.0234 0.9797 0.9909 0.8825 0.1106 0.8562 0.9089 0.9842 0.0294 0.9772 0.9912 0.9263 0.0685 0.9100 0.9426 0.9403 0.0424 0.9302 0.9504

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.9), sensitivity (6.10), precision (6.11), F1-score (6.12), and accuracy (6.13).

One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot

Figure 6.9: One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot

Figure 6.10: One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot

Figure 6.11: One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot

Figure 6.12: One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot

Figure 6.13: One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot

6.2.2 Difference in taxonomic performance for GT classification

Figure @ref{fig:gtClassTax} plots the difference in performance between bacterial and eukaryota GT class members. Hotpep demonstrates the greatest difference in performance between bacteria and eukaryotes, with a more consistent performance for eukaryotes as inferred from the smaller interquartile ranage. Otherwise, there was not significant difference between performance against the two kingdoms.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GT class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.14: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GT class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

Table 6.6: Overall performance of CAZyme classifiers for the classification of bacterial GT class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 0.9989 0.0051 0.9972 1.0005 0.8767 0.1167 0.8389 0.9146 0.9985 0.0069 0.9963 1.0008 0.9292 0.0753 0.9048 0.9536 0.9584 0.0518 0.9416 0.9752
dbCAN 0.9996 0.0027 0.9987 1.0004 0.8777 0.1416 0.8318 0.9236 0.9994 0.0038 0.9982 1.0006 0.9273 0.0997 0.8950 0.9597 0.9573 0.0666 0.9357 0.9789
DIAMOND 0.9986 0.0063 0.9965 1.0006 0.9324 0.1543 0.8824 0.9825 0.9985 0.0072 0.9962 1.0008 0.9557 0.1105 0.9199 0.9916 0.9738 0.0718 0.9505 0.9971
eCAMI 0.9977 0.0090 0.9948 1.0006 0.8589 0.1683 0.8044 0.9135 0.9976 0.0084 0.9949 1.0004 0.9134 0.1121 0.8771 0.9498 0.9526 0.0647 0.9316 0.9736
HMMER 0.9996 0.0024 0.9988 1.0004 0.8485 0.1274 0.8072 0.8898 0.9993 0.0044 0.9978 1.0007 0.9114 0.0950 0.8806 0.9422 0.9467 0.0675 0.9249 0.9686
Hotpep 0.9985 0.0046 0.9970 1.0000 0.6810 0.1931 0.6184 0.7435 0.9956 0.0139 0.9911 1.0001 0.7922 0.1493 0.7438 0.8406 0.8940 0.0798 0.8682 0.9199
Table 6.7: Overall performance of CAZyme classifiers for the classification of eukaryote GT class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 0.9971 0.0103 0.9934 1.0009 0.8518 0.1217 0.8071 0.8964 0.9954 0.0155 0.9897 1.0011 0.9129 0.0770 0.8846 0.9411 0.9378 0.0642 0.9143 0.9614
dbCAN 0.9983 0.0093 0.9949 1.0017 0.8889 0.1384 0.8382 0.9397 0.9980 0.0112 0.9939 1.0021 0.9333 0.0980 0.8973 0.9692 0.9520 0.0807 0.9223 0.9816
DIAMOND 0.9965 0.0108 0.9926 1.0005 0.9302 0.1429 0.8777 0.9826 0.9948 0.0160 0.9889 1.0006 0.9540 0.0999 0.9173 0.9906 0.9656 0.0835 0.9350 0.9962
eCAMI 0.9983 0.0093 0.9949 1.0017 0.8454 0.1579 0.7875 0.9033 0.9979 0.0115 0.9937 1.0021 0.9060 0.1112 0.8652 0.9468 0.9280 0.0953 0.8931 0.9630
HMMER 0.9957 0.0139 0.9906 1.0008 0.9076 0.0654 0.8837 0.9316 0.9963 0.0128 0.9916 1.0010 0.9486 0.0367 0.9351 0.9620 0.9640 0.0240 0.9552 0.9727
Hotpep 0.9983 0.0093 0.9949 1.0017 0.7811 0.1705 0.7185 0.8436 0.9977 0.0125 0.9932 1.0023 0.8644 0.1264 0.8181 0.9108 0.9066 0.1021 0.8691 0.9441
Table 6.8: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote GT class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 0.9981 0.0078 0.9962 1.0000 0.8657 0.1188 0.8374 0.8940 0.9971 0.0115 0.9944 0.9999 0.9220 0.0759 0.9039 0.9401 0.9493 0.0581 0.9354 0.9632
dbCAN 0.9990 0.0065 0.9975 1.0006 0.8827 0.1393 0.8495 0.9159 0.9988 0.0080 0.9969 1.0007 0.9300 0.0983 0.9065 0.9534 0.9549 0.0727 0.9376 0.9722
DIAMOND 0.9977 0.0086 0.9956 0.9997 0.9314 0.1483 0.8961 0.9668 0.9968 0.0120 0.9940 0.9997 0.9550 0.1052 0.9299 0.9800 0.9702 0.0768 0.9519 0.9885
eCAMI 0.9980 0.0090 0.9958 1.0002 0.8529 0.1627 0.8141 0.8917 0.9978 0.0098 0.9954 1.0001 0.9101 0.1109 0.8837 0.9366 0.9417 0.0800 0.9226 0.9608
HMMER 0.9979 0.0096 0.9956 1.0002 0.8747 0.1080 0.8489 0.9005 0.9980 0.0092 0.9958 1.0002 0.9279 0.0768 0.9095 0.9462 0.9544 0.0532 0.9417 0.9671
Hotpep 0.9984 0.0070 0.9967 1.0001 0.7253 0.1889 0.6802 0.7703 0.9966 0.0132 0.9934 0.9997 0.8242 0.1433 0.7900 0.8584 0.8996 0.0899 0.8782 0.9210

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.15), sensitivity (6.16), precision (6.17), F1-score (6.18), and accuracy (6.19).

One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot

Figure 6.15: One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot

Figure 6.16: One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot

Figure 6.17: One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot

Figure 6.18: One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot

Figure 6.19: One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot

6.2.3 Difference in taxonomic performance for PL classification

Figure @ref{fig:plClassTax} plots the difference in performance between bacterial and eukaryota PL class members. Most classifiers showed a strong consistency in performance between the bacterial and eukaryotic test sets (as inferred from the small interquartile ranges), except eCAMI which showed a signficantly greater range in performance when classifying bacterial proteins.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying PL class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.20: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying PL class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

Table 6.9: Overall performance of CAZyme classifiers for the classification of bacterial PL class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 0.9996 0.0020 0.9988 1.0004 0.7233 0.3707 0.5767 0.8700 0.8466 0.3608 0.7038 0.9893 0.7637 0.3562 0.6228 0.9046 0.9900 0.0162 0.9836 0.9964
dbCAN 1.0000 0.0000 1.0000 1.0000 0.8210 0.3033 0.7011 0.9410 0.9259 0.2669 0.8204 1.0315 0.8572 0.2825 0.7454 0.9690 0.9941 0.0089 0.9906 0.9976
DIAMOND 0.9993 0.0027 0.9982 1.0003 0.8792 0.2539 0.7788 0.9796 0.9392 0.2122 0.8552 1.0231 0.8921 0.2297 0.8012 0.9829 0.9952 0.0075 0.9923 0.9982
eCAMI 0.9989 0.0032 0.9977 1.0002 0.7129 0.3258 0.5866 0.8393 0.8798 0.3141 0.7580 1.0015 0.7725 0.3077 0.6532 0.8918 0.9880 0.0173 0.9813 0.9947
HMMER 0.9992 0.0039 0.9977 1.0008 0.8487 0.2827 0.7368 0.9605 0.9074 0.2786 0.7972 1.0176 0.8709 0.2745 0.7623 0.9795 0.9945 0.0089 0.9910 0.9980
Hotpep 0.9988 0.0046 0.9969 1.0006 0.8035 0.2913 0.6883 0.9188 0.9145 0.2681 0.8084 1.0205 0.8439 0.2678 0.7379 0.9498 0.9904 0.0167 0.9838 0.9971
Table 6.10: Overall performance of CAZyme classifiers for the classification of eukaryote PL class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 0.9982 0.0040 0.9955 1.0009 0.9021 0.3001 0.7005 1.1037 0.8561 0.3075 0.6495 1.0627 0.8743 0.2980 0.6741 1.0745 0.9964 0.0050 0.9931 0.9998
dbCAN 1.0000 0.0000 1.0000 1.0000 0.9557 0.1064 0.8842 1.0272 1.0000 0.0000 1.0000 1.0000 0.9742 0.0630 0.9319 1.0165 0.9974 0.0064 0.9931 1.0016
DIAMOND 1.0000 0.0000 1.0000 1.0000 0.8951 0.3004 0.6933 1.0969 0.9091 0.3015 0.7065 1.1116 0.9015 0.3000 0.6999 1.1031 0.9973 0.0065 0.9929 1.0016
eCAMI 1.0000 0.0000 1.0000 1.0000 0.8610 0.2982 0.6607 1.0614 0.9091 0.3015 0.7065 1.1116 0.8825 0.2966 0.6832 1.0817 0.9955 0.0068 0.9909 1.0001
HMMER 1.0000 0.0000 1.0000 1.0000 0.9860 0.0464 0.9549 1.0172 1.0000 0.0000 1.0000 1.0000 0.9924 0.0251 0.9755 1.0093 0.9982 0.0060 0.9941 1.0022
Hotpep 0.9979 0.0069 0.9933 1.0025 0.8648 0.3056 0.6595 1.0701 0.8951 0.3004 0.6933 1.0969 0.8769 0.2995 0.6757 1.0781 0.9947 0.0120 0.9866 1.0027
Table 6.11: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote PL class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 0.9992 0.0028 0.9983 1.0001 0.7751 0.3573 0.6576 0.8925 0.8493 0.3421 0.7369 0.9618 0.7957 0.3402 0.6839 0.9075 0.9919 0.0141 0.9872 0.9965
dbCAN 1.0000 0.0000 1.0000 1.0000 0.8600 0.2674 0.7721 0.9479 0.9474 0.2263 0.8730 1.0217 0.8911 0.2451 0.8105 0.9716 0.9950 0.0083 0.9923 0.9978
DIAMOND 0.9995 0.0023 0.9987 1.0002 0.8838 0.2641 0.7970 0.9706 0.9305 0.2375 0.8524 1.0085 0.8948 0.2479 0.8133 0.9763 0.9958 0.0072 0.9935 0.9982
eCAMI 0.9992 0.0028 0.9983 1.0001 0.7547 0.3215 0.6505 0.8589 0.8880 0.3069 0.7886 0.9875 0.8035 0.3049 0.7047 0.9023 0.9901 0.0154 0.9851 0.9951
HMMER 0.9995 0.0033 0.9984 1.0006 0.8884 0.2465 0.8074 0.9694 0.9342 0.2374 0.8562 1.0122 0.9061 0.2372 0.8281 0.9840 0.9955 0.0083 0.9928 0.9983
Hotpep 0.9985 0.0053 0.9968 1.0003 0.8213 0.2927 0.7251 0.9175 0.9089 0.2738 0.8189 0.9989 0.8534 0.2736 0.7635 0.9434 0.9917 0.0155 0.9866 0.9967

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.21), sensitivity (6.22), precision (6.23), F1-score (6.24), and accuracy (6.25).

One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot

Figure 6.21: One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot

Figure 6.22: One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot

Figure 6.23: One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot

Figure 6.24: One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot

Figure 6.25: One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot

6.2.4 Difference in taxonomic performance for CE classification

Figure @ref{fig:ceClassTax} plots the difference in performance between bacterial and eukaryota PL class members. Most classifiers showed a strong consistency in performance between the bacterial and eukaryotic test sets (as inferred from the small interquartile ranges), except eCAMI which showed a signficantly greater range in performance when classifying bacterial proteins.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CE class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.26: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CE class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

Table 6.12: Overall performance of CAZyme classifiers for the classification of bacterial CE class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 0.9963 0.0108 0.9927 0.9998 0.9565 0.0721 0.9331 0.9798 0.9623 0.1048 0.9283 0.9963 0.9549 0.0734 0.9311 0.9787 0.9932 0.0114 0.9895 0.9969
dbCAN 0.9929 0.0207 0.9862 0.9996 0.9841 0.0681 0.9620 1.0062 0.9432 0.1501 0.8945 0.9918 0.9536 0.1082 0.9185 0.9887 0.9924 0.0195 0.9860 0.9987
DIAMOND 0.9927 0.0215 0.9857 0.9997 0.9279 0.1857 0.8678 0.9881 0.9416 0.1551 0.8913 0.9918 0.9122 0.1679 0.8578 0.9667 0.9894 0.0224 0.9821 0.9966
eCAMI 0.9902 0.0209 0.9835 0.9970 0.9172 0.1432 0.8708 0.9637 0.9070 0.1585 0.8557 0.9584 0.8980 0.1371 0.8535 0.9424 0.9858 0.0212 0.9789 0.9927
HMMER 0.9965 0.0103 0.9932 0.9998 0.9180 0.1368 0.8736 0.9623 0.9661 0.0922 0.9362 0.9960 0.9314 0.0936 0.9011 0.9618 0.9924 0.0106 0.9890 0.9959
Hotpep 0.9885 0.0215 0.9815 0.9954 0.9763 0.0717 0.9530 0.9995 0.8965 0.1683 0.8419 0.9511 0.9251 0.1223 0.8855 0.9648 0.9872 0.0218 0.9802 0.9943
Table 6.13: Overall performance of CAZyme classifiers for the classification of eukaryote CE class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 0.9996 0.0019 0.9989 1.0004 0.9055 0.2143 0.8224 0.9886 0.9583 0.1904 0.8845 1.0322 0.9262 0.1966 0.8500 1.0025 0.9965 0.0055 0.9943 0.9986
dbCAN 1.0000 0.0000 1.0000 1.0000 0.9375 0.2111 0.8556 1.0194 0.9643 0.1890 0.8910 1.0376 0.9473 0.1975 0.8707 1.0239 0.9982 0.0055 0.9961 1.0003
DIAMOND 1.0000 0.0000 1.0000 1.0000 0.9026 0.2673 0.7990 1.0063 0.9286 0.2623 0.8269 1.0303 0.9136 0.2623 0.8119 1.0153 0.9968 0.0086 0.9935 1.0002
eCAMI 0.9996 0.0020 0.9988 1.0004 0.7314 0.3484 0.5963 0.8666 0.8884 0.3143 0.7665 1.0103 0.7808 0.3228 0.6556 0.9060 0.9922 0.0099 0.9884 0.9961
HMMER 0.9993 0.0027 0.9982 1.0003 0.9929 0.0378 0.9782 1.0075 0.9869 0.0483 0.9682 1.0056 0.9888 0.0330 0.9760 1.0016 0.9989 0.0031 0.9977 1.0001
Hotpep 1.0000 0.0000 1.0000 1.0000 0.7806 0.3181 0.6572 0.9039 0.8929 0.3150 0.7707 1.0150 0.8247 0.3081 0.7052 0.9441 0.9930 0.0084 0.9897 0.9962
Table 6.14: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote CE class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 0.9977 0.0085 0.9956 0.9997 0.9352 0.1498 0.8986 0.9717 0.9606 0.1455 0.9252 0.9961 0.9429 0.1383 0.9092 0.9766 0.9946 0.0095 0.9923 0.9969
dbCAN 0.9959 0.0161 0.9920 0.9998 0.9646 0.1464 0.9289 1.0003 0.9520 0.1664 0.9114 0.9926 0.9510 0.1507 0.9142 0.9877 0.9948 0.0155 0.9910 0.9986
DIAMOND 0.9958 0.0167 0.9917 0.9998 0.9174 0.2219 0.8632 0.9715 0.9361 0.2050 0.8861 0.9861 0.9128 0.2107 0.8614 0.9642 0.9925 0.0182 0.9880 0.9969
eCAMI 0.9941 0.0166 0.9901 0.9982 0.8396 0.2646 0.7751 0.9041 0.8992 0.2344 0.8421 0.9564 0.8490 0.2384 0.7909 0.9072 0.9885 0.0176 0.9842 0.9928
HMMER 0.9977 0.0081 0.9957 0.9996 0.9493 0.1129 0.9217 0.9768 0.9748 0.0772 0.9560 0.9936 0.9554 0.0794 0.9360 0.9748 0.9952 0.0089 0.9930 0.9973
Hotpep 0.9933 0.0173 0.9891 0.9975 0.8945 0.2320 0.8379 0.9511 0.8950 0.2385 0.8368 0.9532 0.8832 0.2235 0.8286 0.9377 0.9896 0.0176 0.9853 0.9939

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.27), sensitivity (6.28), precision (6.29), F1-score (6.30), and accuracy (6.31).

One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot

Figure 6.27: One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot

Figure 6.28: One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot

Figure 6.29: One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot

Figure 6.30: One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot

Figure 6.31: One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot

6.2.5 Difference in taxonomic performance for AA classification

Figure @ref{fig:aaClassTax} plots the difference in performance between bacterial and eukaryota AA class members. As inferred from comparing the interquartile ranges, all classifiers demonstrates a more consistent performance against bacterial than eukaryotic AA class members. However, this most likely due to the AA class predominately containing eukaryotic proteins. Therefore, it is relatively ‘easier’ for a classifier to determine a bacterial protein does not belong to the class because there is low sequence similarity between bacterial proteins and the representative models of the AA class, which over represents eukaryotic proteins. Additionally, with fewer bacterial AA proteins, there are fewer oppurtunities for the classifier to miss classify a AA member as a non-AA member, resulting in a more consistent higher F1-score than eukaryotes, which have many opprutnities for miss classification of AA members as non-AA members.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying AA class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.32: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying AA class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

Table 6.15: Overall performance of CAZyme classifiers for the classification of bacterial AA class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
dbCAN 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
DIAMOND 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
eCAMI 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
HMMER 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
Hotpep 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1
Table 6.16: Overall performance of CAZyme classifiers for the classification of eukaryote AA class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 0.9905 0.0214 0.9820 0.9989 0.8855 0.1292 0.8344 0.9366 0.9154 0.1703 0.8480 0.9828 0.8861 0.1310 0.8343 0.9380 0.9811 0.0274 0.9702 0.9919
dbCAN 0.9905 0.0226 0.9815 0.9994 0.9139 0.1271 0.8637 0.9642 0.9164 0.1699 0.8492 0.9836 0.9033 0.1368 0.8492 0.9574 0.9844 0.0283 0.9732 0.9956
DIAMOND 0.9904 0.0222 0.9816 0.9992 0.8351 0.2779 0.7251 0.9450 0.8826 0.2391 0.7880 0.9771 0.8278 0.2507 0.7286 0.9270 0.9825 0.0248 0.9727 0.9923
eCAMI 0.9912 0.0193 0.9836 0.9988 0.7837 0.1955 0.7064 0.8611 0.9142 0.1712 0.8465 0.9819 0.8190 0.1560 0.7573 0.8807 0.9751 0.0293 0.9635 0.9867
HMMER 0.9897 0.0213 0.9812 0.9981 0.9550 0.0756 0.9251 0.9849 0.9102 0.1654 0.8448 0.9756 0.9217 0.1183 0.8749 0.9685 0.9850 0.0244 0.9754 0.9947
Hotpep 0.9901 0.0230 0.9810 0.9992 0.8938 0.1446 0.8365 0.9510 0.9137 0.1749 0.8445 0.9829 0.8890 0.1426 0.8326 0.9454 0.9826 0.0287 0.9712 0.9939
Table 6.17: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote AA class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 0.9930 0.0187 0.9868 0.9993 0.9165 0.1213 0.8760 0.9569 0.9383 0.1497 0.8884 0.9882 0.9169 0.1226 0.8760 0.9578 0.9862 0.0248 0.9779 0.9945
dbCAN 0.9930 0.0196 0.9865 0.9996 0.9372 0.1147 0.8989 0.9754 0.9390 0.1492 0.8892 0.9887 0.9294 0.1241 0.8881 0.9708 0.9886 0.0251 0.9803 0.9970
DIAMOND 0.9930 0.0194 0.9866 0.9995 0.8796 0.2475 0.7971 0.9622 0.9143 0.2099 0.8443 0.9843 0.8743 0.2267 0.7987 0.9499 0.9872 0.0225 0.9797 0.9947
eCAMI 0.9936 0.0169 0.9880 0.9992 0.8422 0.1926 0.7780 0.9064 0.9374 0.1505 0.8872 0.9876 0.8679 0.1556 0.8160 0.9198 0.9818 0.0273 0.9727 0.9909
HMMER 0.9925 0.0187 0.9862 0.9987 0.9671 0.0673 0.9447 0.9896 0.9345 0.1462 0.8857 0.9832 0.9429 0.1066 0.9073 0.9784 0.9891 0.0218 0.9818 0.9963
Hotpep 0.9928 0.0201 0.9861 0.9995 0.9225 0.1319 0.8785 0.9664 0.9370 0.1536 0.8858 0.9883 0.9190 0.1311 0.8753 0.9627 0.9873 0.0256 0.9788 0.9958

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.33), sensitivity (6.34), precision (6.35), F1-score (6.36), and accuracy (6.37).

One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot

Figure 6.33: One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot

Figure 6.34: One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot

Figure 6.35: One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot

Figure 6.36: One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot

Figure 6.37: One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot

6.2.6 Difference in taxonomic performance for CBM classification

Figure @ref{fig:cbmClassTax} plots the difference in performance between bacterial and eukaryota CBM class members. Most classifiers demonstrated a greater variation in performance against eukaryotic than bacterial proteins, which may be the result of greater sequence diversity within the eukaryotic CBMs than bacterial CBMs.

One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CBM class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as 'all') kingdoms. One point on the scatter plot represents the F1-score for one test set.

Figure 6.38: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CBM class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.

The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).

Table 6.18: Overall performance of CAZyme classifiers for the classification of bacterial CBM class members
Classifier Mean Bacteria Specificity Bacteria Specificity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Sensitivity Bacteria Sensitivity Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Precision Bacteria Precision Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria F1-score Bacteria F1-score Standard Deviation Bacteria Lower CI Bacteria Upper CI Mean Bacteria Accuracy Bacteria Accuracy Standard Deviation Bacteria Lower CI Bacteria Upper CI
CUPP 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8684 0.1136 0.8316 0.9052
dbCAN 0.9937 0.0112 0.9901 0.9974 0.8371 0.1957 0.7736 0.9005 0.9276 0.1309 0.8851 0.9700 0.8643 0.1577 0.8132 0.9155 0.9754 0.0288 0.9661 0.9847
DIAMOND 0.9949 0.0117 0.9912 0.9987 0.8664 0.2074 0.7991 0.9336 0.9511 0.1653 0.8975 1.0047 0.8986 0.1810 0.8399 0.9573 0.9811 0.0256 0.9728 0.9894
eCAMI 0.9325 0.0594 0.9132 0.9518 0.8460 0.2300 0.7715 0.9206 0.6447 0.1787 0.5867 0.7026 0.7118 0.1856 0.6516 0.7719 0.9253 0.0579 0.9066 0.9441
HMMER 0.9948 0.0097 0.9917 0.9980 0.5664 0.2516 0.4849 0.6480 0.9243 0.1432 0.8779 0.9707 0.6602 0.1984 0.5958 0.7245 0.9468 0.0327 0.9362 0.9574
Hotpep 0.8869 0.0638 0.8662 0.9076 0.8210 0.2358 0.7445 0.8974 0.4834 0.1837 0.4239 0.5430 0.5902 0.1900 0.5286 0.6518 0.8812 0.0629 0.8608 0.9016
Table 6.19: Overall performance of CAZyme classifiers for the classification of eukaryote CBM class members
Classifier Mean Eukaryote Specificity Eukaryote Specificity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Sensitivity Eukaryote Sensitivity Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Precision Eukaryote Precision Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote F1-score Eukaryote F1-score Standard Deviation Eukaryote Lower CI Eukaryote Upper CI Mean Eukaryote Accuracy Eukaryote Accuracy Standard Deviation Eukaryote Lower CI Eukaryote Upper CI
CUPP 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.9063 0.0369 0.8928 0.9199
dbCAN 0.9937 0.0093 0.9903 0.9971 0.7549 0.1932 0.6840 0.8258 0.9227 0.1175 0.8796 0.9658 0.8169 0.1492 0.7621 0.8716 0.9698 0.0253 0.9605 0.9790
DIAMOND 0.9944 0.0079 0.9915 0.9973 0.8653 0.2010 0.7916 0.9390 0.9327 0.0997 0.8961 0.9692 0.8847 0.1498 0.8297 0.9396 0.9831 0.0208 0.9755 0.9908
eCAMI 0.9679 0.0292 0.9572 0.9786 0.7682 0.2024 0.6939 0.8424 0.7330 0.1893 0.6636 0.8025 0.7344 0.1657 0.6736 0.7952 0.9481 0.0385 0.9339 0.9622
HMMER 0.9975 0.0055 0.9955 0.9996 0.3410 0.1684 0.2792 0.4027 0.8849 0.2732 0.7847 0.9852 0.4773 0.2055 0.4019 0.5527 0.9360 0.0334 0.9237 0.9482
Hotpep 0.9194 0.0398 0.9048 0.9340 0.7399 0.1994 0.6668 0.8131 0.4898 0.1365 0.4397 0.5399 0.5723 0.1282 0.5253 0.6193 0.9008 0.0446 0.8844 0.9171
Table 6.20: Overall performance of CAZyme classifiers for the classification of bacterial and eukaryote CBM class members
Classifier Mean All Specificity All Specificity Standard Deviation All Lower CI All Upper CI Mean All Sensitivity All Sensitivity Standard Deviation All Lower CI All Upper CI Mean All Precision All Precision Standard Deviation All Lower CI All Upper CI Mean All F1-score All F1-score Standard Deviation All Lower CI All Upper CI Mean All Accuracy All Accuracy Standard Deviation All Lower CI All Upper CI
CUPP 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8852 0.0898 0.8638 0.9066
dbCAN 0.9937 0.0103 0.9912 0.9962 0.8007 0.1975 0.7536 0.8478 0.9254 0.1243 0.8958 0.9551 0.8433 0.1547 0.8064 0.8802 0.9729 0.0272 0.9664 0.9794
DIAMOND 0.9947 0.0101 0.9923 0.9971 0.8659 0.2031 0.8175 0.9143 0.9429 0.1395 0.9097 0.9762 0.8924 0.1669 0.8526 0.9322 0.9820 0.0235 0.9764 0.9876
eCAMI 0.9482 0.0513 0.9359 0.9604 0.8116 0.2202 0.7591 0.8641 0.6838 0.1874 0.6391 0.7285 0.7218 0.1762 0.6798 0.7638 0.9354 0.0512 0.9232 0.9476
HMMER 0.9960 0.0082 0.9941 0.9980 0.4666 0.2448 0.4082 0.5250 0.9069 0.2101 0.8568 0.9570 0.5792 0.2200 0.5267 0.6316 0.9420 0.0332 0.9341 0.9499
Hotpep 0.9013 0.0565 0.8878 0.9148 0.7851 0.2226 0.7320 0.8381 0.4862 0.1634 0.4473 0.5252 0.5823 0.1646 0.5430 0.6215 0.8898 0.0560 0.8765 0.9032

The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.39), sensitivity (6.40), precision (6.41), F1-score (6.42), and accuracy (6.43).

One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot

Figure 6.39: One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot

Figure 6.40: One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot

Figure 6.41: One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot

Figure 6.42: One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot

One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot

Figure 6.43: One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot

6.2.7 Multilabel classification of CAZy classes

To represent the overall CAZy class classification performance, and take into consideration of CAZy class multi-label classification, the Rand Index was calculated for each taxonomy group per CAZy classifier.

Table 6.21: Overall performance of CAZy class classification (represented by the Rand Index) by CAZy classifiers per taxonomy group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Euk Mean Euk Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9615 0.1074 0.9591 0.9639 0.9637 0.1044 0.9611 0.9663 0.9625 0.1061 0.9611 0.9663
dbCAN 0.9802 0.0794 0.9785 0.9820 0.9781 0.0832 0.9760 0.9801 0.9793 0.0811 0.9760 0.9801
DIAMOND 0.9845 0.0711 0.9830 0.9861 0.9844 0.0710 0.9826 0.9862 0.9845 0.0711 0.9826 0.9862
eCAMI 0.9674 0.1008 0.9652 0.9697 0.9630 0.1064 0.9604 0.9657 0.9655 0.1034 0.9604 0.9657
HMMER 0.9725 0.0926 0.9704 0.9745 0.9750 0.0884 0.9728 0.9772 0.9736 0.0908 0.9728 0.9772
Hotpep 0.9495 0.1217 0.9468 0.9522 0.9533 0.1170 0.9504 0.9562 0.9512 0.1197 0.9504 0.9562

The Adjusted Rand Index was also calculated in order to take into consideration chance.

Table 6.22: Overall performance of CAZy class classification (represented by the Adjusted Rand Index) by CAZy classifiers per taxonomy group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Euk Mean Euk Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9004 0.2829 0.8941 0.9066 0.9011 0.2880 0.8939 0.9083 0.9007 0.2852 0.8939 0.9083
dbCAN 0.9427 0.2304 0.9376 0.9478 0.9361 0.2426 0.9301 0.9422 0.9398 0.2359 0.9301 0.9422
DIAMOND 0.9546 0.2078 0.9500 0.9592 0.9543 0.2080 0.9491 0.9595 0.9545 0.2079 0.9491 0.9595
eCAMI 0.9140 0.2691 0.9081 0.9200 0.8958 0.3006 0.8884 0.9033 0.9060 0.2836 0.8884 0.9033
HMMER 0.9225 0.2622 0.9167 0.9284 0.9322 0.2425 0.9262 0.9383 0.9268 0.2537 0.9262 0.9383
Hotpep 0.8681 0.3222 0.8609 0.8752 0.8739 0.3201 0.8659 0.8818 0.8706 0.3212 0.8659 0.8818

6.3 CAZy family classification

Table 6.23: Rand Index of CAZyme classifier classification of CAZy family annotations per taxonomt group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Euk Mean Euk Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9994 0.0016 0.9994 0.9995 0.9995 0.0015 0.9994 0.9995 0.9995 0.0015 0.9994 0.9995
dbCAN 0.9997 0.0011 0.9997 0.9997 0.9997 0.0012 0.9997 0.9997 0.9997 0.0011 0.9997 0.9997
DIAMOND 0.9998 0.0010 0.9997 0.9998 0.9998 0.0010 0.9997 0.9998 0.9998 0.0010 0.9998 0.9998
eCAMI 0.9994 0.0018 0.9994 0.9994 0.9995 0.0015 0.9994 0.9995 0.9994 0.0017 0.9994 0.9995
HMMER 0.9996 0.0014 0.9995 0.9996 0.9996 0.0014 0.9996 0.9996 0.9996 0.0014 0.9996 0.9996
Hotpep 0.9990 0.0025 0.9989 0.9990 0.9993 0.0019 0.9992 0.9993 0.9991 0.0023 0.9991 0.9991
Table 6.24: Adjusted Rand Index of CAZyme classifier classification of CAZy family annotations per taxonomt group
Prediction_tool Bact Mean Bact Standard Deviation Bact Lower CI Bact Upper CI Eukaryote Mean Eukaryote Standard Deviation Euk Lower CI Euk Upper CI All Mean All Standard Deviation All Lower CI All Upper CI
CUPP 0.9118 0.2654 0.9059 0.9177 0.9073 0.2782 0.9003 0.9142 0.9098 0.2712 0.9053 0.9143
dbCAN 0.9420 0.2307 0.9369 0.9472 0.9354 0.2422 0.9294 0.9414 0.9391 0.2359 0.9352 0.9430
DIAMOND 0.9529 0.2104 0.9482 0.9576 0.9531 0.2105 0.9478 0.9583 0.9530 0.2105 0.9495 0.9565
eCAMI 0.9148 0.2621 0.9090 0.9207 0.8988 0.2961 0.8914 0.9062 0.9077 0.2778 0.9031 0.9123
HMMER 0.9201 0.2647 0.9142 0.9260 0.9311 0.2430 0.9251 0.9372 0.9250 0.2554 0.9208 0.9292
Hotpep 0.8715 0.3087 0.8647 0.8784 0.8812 0.3077 0.8735 0.8889 0.8758 0.3083 0.8707 0.8809

7 Evaluation of (re)combining tools

Often, classifiers are not used in isolation. Frequently, classifiers are combined to produce an overall more accurate classifier. An example of this is dbCAN. dbCAN contains the classifiers HMMER, Hotpep and DIAMOND, the consensus classifications of these classifiers are interpreted as the output for dbCAN.

Defining new combinations of classifiers may reveal a combination that is more accurate than existing combinations and/or using the tools in isolation.

The following combinations of tools were evaluted: - HMMER, DIAMOND and CUPP - HMMER, DIAMOND and eCAMI

7.1 Binary classification

Table @ref{sumstatsRecombined} contains the summary statistics for the binary classification of proteins, for the inividual and combined classifiers.

Table 7.1: Overall performance of CAZyme classifiers differentiation between CAZymes and non-CAZymes
Classifier Spec Mean Spec Standard Deviation Spec Lower CI Spec Upper CI Sens Mean Sens Standard Deviation Sens Lower CI Sens Upper CI Prec Mean Prec Standard Deviation Prec Lower CI Prec Upper CI F1-score Mean F1-score Standard Deviation F1-score Lower CI F1-score Upper CI Acc Mean Acc Standard Deviation Acc Lower CI Acc Upper CI
CUPP 0.9917 0.0155 0.9891 0.9943 0.8570 0.0822 0.8433 0.8707 0.9908 0.0172 0.9879 0.9936 0.9167 0.0529 0.9078 0.9255 0.9244 0.0416 0.9174 0.9313
dbCAN 0.9869 0.0244 0.9828 0.9909 0.9087 0.1119 0.8900 0.9274 0.9866 0.0240 0.9826 0.9906 0.9412 0.0793 0.9280 0.9545 0.9478 0.0562 0.9384 0.9572
DIAMOND 0.9844 0.0262 0.9800 0.9888 0.9261 0.1293 0.9045 0.9478 0.9847 0.0251 0.9805 0.9889 0.9481 0.0904 0.9329 0.9632 0.9553 0.0639 0.9446 0.9660
eCAMI 0.9836 0.0256 0.9793 0.9879 0.8610 0.1323 0.8389 0.8831 0.9826 0.0253 0.9784 0.9868 0.9112 0.0865 0.8967 0.9256 0.9223 0.0644 0.9115 0.9331
HMMER 0.9901 0.0162 0.9874 0.9929 0.8831 0.0832 0.8692 0.8970 0.9893 0.0174 0.9864 0.9922 0.9305 0.0611 0.9203 0.9407 0.9366 0.0421 0.9296 0.9437
NA 0.9837 0.0285 0.9790 0.9885 0.9137 0.0285 0.9090 0.9185 0.9825 0.0306 0.9774 0.9876 0.9469 0.0295 0.9419 0.9518 0.9487 0.0285 0.9440 0.9535
NA 0.9806 0.0323 0.9752 0.9860 0.9406 0.0323 0.9352 0.9460 0.9798 0.0336 0.9741 0.9854 0.9598 0.0329 0.9543 0.9653 0.9606 0.0323 0.9552 0.9660
Hotpep 0.9840 0.0256 0.9797 0.9883 0.8189 0.1322 0.7968 0.8410 0.9815 0.0286 0.9767 0.9863 0.8862 0.0914 0.8709 0.9015 0.9014 0.0664 0.8903 0.9125
Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.

Figure 7.1: Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.

Figure @ref{RTstatsRecombined} presents the distribution of statistical parameters per CAZyme classifer (including recombined classifiers) for each statistical parameter for evaluating differentiation of CAZymes and non-CAZymes.

Proportional area plot of the disitrubution of statistical parameters across all test sets.

Figure 7.2: Proportional area plot of the disitrubution of statistical parameters across all test sets.

7.1.1 Specificity

Specificity is the proportion of known negatives (known non-CAZymes) which are correctly classified as negatives (non-CAZymes).

Figure 3.2 is a graphical representation of the results calculated in table 3.1.

One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 7.3: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

7.1.2 Sensitivity

Sensitivity (also known as recall) is the proportion of known positives (CAZymes) that are correctly identified as positives (CAZymes).

Figure 3.3 graphically represents of the results calculated in table 3.1.

One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 7.4: One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

7.1.3 Precision

Precision is the proportion of positive predictions by the classifiers that are correct.

In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes.

Figure 3.4 is a visual representation of the results calculated in table 3.1.

One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

Figure 7.5: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.

7.1.4 F1-score

The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.5 shows the F1-score from each test set, for each classifier.

Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

Figure 7.6: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

7.1.5 Accuracy

Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.6 is a plot of respective data from table 3.1.

Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

Figure 7.7: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.

Below is a combination (3x2) plot of the above plots for evaluating the binary CAZyme/non-CAZyme classification performance between dbCAN and the user defined combination of tools. In this case: - dbCAN - HMMER, DIAMOND, CUPP - HMMER, DIAMOND, eCAMI

7.2 Classification of CAZy classes

recombined_tools_class_df_pred

Table 7.2: Overall performance of CAZyme classifiers CAZy class classification performance
Classifier Spec Mean Spec Standard Deviation Spec Lower CI Spec Upper CI Sens Mean Sens Standard Deviation Sens Lower CI Sens Upper CI Prec Mean Prec Standard Deviation Prec Lower CI Prec Upper CI F1-score Mean F1-score Standard Deviation F1-score Lower CI F1-score Upper CI Acc Mean Acc Standard Deviation Acc Lower CI Acc Upper CI Prediction_tool
CUPP 0.9975 0.0097 0.9964 0.9985 0.7118 0.3888 0.6711 0.7526 0.7695 0.4098 0.7265 0.8124 0.7343 0.3937 0.6930 0.7756 0.9554 0.0635 0.9487 0.9620 CUPP
dbCAN 0.9960 0.0126 0.9947 0.9973 0.9016 0.1705 0.8837 0.9194 0.9624 0.1294 0.9488 0.9760 0.9218 0.1454 0.9065 0.9370 0.9779 0.0417 0.9735 0.9823 dbCAN
DIAMOND 0.9956 0.0130 0.9942 0.9969 0.9078 0.1960 0.8872 0.9283 0.9578 0.1526 0.9418 0.9738 0.9213 0.1725 0.9032 0.9394 0.9816 0.0426 0.9771 0.9861 DIAMOND
eCAMI 0.9852 0.0324 0.9818 0.9886 0.8362 0.2157 0.8137 0.8588 0.8966 0.2066 0.8749 0.9182 0.8487 0.1950 0.8282 0.8691 0.9590 0.0536 0.9534 0.9646 eCAMI
HMMER 0.9966 0.0103 0.9955 0.9977 0.8270 0.2407 0.8017 0.8522 0.9612 0.1388 0.9466 0.9757 0.8675 0.2013 0.8464 0.8886 0.9686 0.0392 0.9645 0.9727 HMMER
HMMER_DIAMOND_CUPP 0.9979 0.0091 0.9969 0.9988 0.8234 0.2637 0.7958 0.8511 0.9711 0.1416 0.9563 0.9860 0.8648 0.2235 0.8414 0.8883 0.9723 0.0404 0.9681 0.9766 HMMER_DIAMOND_CUPP
HMMER_DIAMOND_eCAMI 0.9963 0.0121 0.9950 0.9975 0.9020 0.1851 0.8826 0.9214 0.9621 0.1345 0.9480 0.9762 0.9208 0.1580 0.9043 0.9374 0.9799 0.0416 0.9755 0.9842 HMMER_DIAMOND_eCAMI
Hotpep 0.9749 0.0471 0.9700 0.9799 0.8317 0.2120 0.8095 0.8540 0.8576 0.2495 0.8314 0.8837 0.8207 0.2116 0.7985 0.8429 0.9421 0.0673 0.9350 0.9491 Hotpep

Below a proportional area plot representing the F-beta score for each CAZyme classifier for each test set is generated. each square is sized proportional to the relative sample size. Every class was not included in every sample, resulting in different sample sizes between CAZy classes, the same between classifiers.

A dataframe of the number of test sets containing each CAZy class is generated.

##   Prediction_tool GH GT PL CE AA CBM
## 1           dbCAN 70 70 38 67 37  70
## 2           HMMER 70 70 38 67 37  70
## 3         DIAMOND 70 70 38 67 37  70
## 4          Hotpep 70 70 38 67 37  70
## 5            CUPP 70 70 38 67 37  70
## 6           eCAMI 70 70 39 67 37  70
## 7           H_D_C 70 70 38 67 37  70
## 8           H_D_E 70 70 38 67 37  70
95% confidence interval around the mean CAZy class classification per CAZy class

Figure 7.8: 95% confidence interval around the mean CAZy class classification per CAZy class

The sensitivity of each CAZyme classifier can be plotted against the specificity for each CAZy class, however plotting all CAZy classes in a single plot produces an overally cramped plot, unless very few test sets were used.

7.3 Performance per CAZy class

Below the prediction sensitivity is plotted against the specificity for each classifier, and a separate plot is generated for each CAZy class.

The scatter plots of sensitivity against specificity overlay a coloured contour to highlight the distribution of the points. When too many points have the same value a contour cannot be generated. In order to plot a contour noise is added to the data. The original data is used to plot the scatter plot and the data with added noise is used to plot the contour.

The percentage of the data points which need noise to be added to them in order to generate a contour varies from data set to data set. To change the percentage of the data points with noise added to them, change the third value of call to the function plot.class.sens.vs.spec(), which is used to generate the plots. The third value is the percentage of data points to add noise to, written in decimal form.

7.3.0.1 CAZy class classification for GH

## png 
##   2

7.3.0.2 CAZy class classification for GT

## png 
##   2

7.3.0.3 CAZy class classification for PL

## png 
##   2

7.3.0.4 CAZy class classification for CE

## png 
##   2

7.3.0.5 CAZy class classification for AA

## png 
##   2

7.3.0.6 CAZy class classification for CBM

## png 
##   2

7.4 Rand Index and Adjusted Rand Index of CAZy Class Prediction

A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.

The RI is the measure of accuracy across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct). The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct.

95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 7.9: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Table 7.3: Adjusted Rand Index of CAZyme classifier classification of CAZy class annotations and 95% confidence interval
Prediction_tool Mean Standard Deviation Lower CI Upper CI
dbCAN 0.9455 0.2254 0.9418 0.9492
HMMER 0.9268 0.2537 0.9226 0.9310
DIAMOND 0.9545 0.2079 0.9510 0.9579
Hotpep 0.8706 0.3212 0.8653 0.8759
CUPP 0.9007 0.2852 0.8960 0.9054
eCAMI 0.9060 0.2836 0.9013 0.9107
HMMER_DIAMOND_CUPP 0.9355 0.2392 0.9316 0.9395
HMMER_DIAMOND_eCAMI 0.9505 0.2155 0.9470 0.9541
95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 7.10: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Plot are violin plots underlying scatter plots, presenting the RI and ARI for every protein across all test sets.

Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 7.11: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 7.12: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

7.5 Classification of CAZy families

The following section evaluates the performance of combining CAZyme classifiers on predict CAZy family classifications, comparing the performance between the user-defined combination of classifiers and the individual classifiers.

7.6 Performance per CAZy family

To evaluate the performance of predicting each CAZy family independent of all other CAZy families, the sensitivity and precision for each CAZy family, for each CAZyme classifier was calculated and plotted against each other (Fig.??). Whereas sensitivity was plotted against sensitivity for CAZy classes, owing to the extremely small variation in specificity scores, sensitivity was plotted as a percentage against log10 of the specificity percentage.

Later on in this report the sensitivity for each CAZy family is plotted against specificity, as was done with CAZy class. However, owing to extremely small different in specificity, with no tool producing a specificity less than 0.995 it is extremely difficult to separate performance by specificity, so a boxplot and scatter plot for each is plotted. Each point represents one test set, and test sets are grouped by CAZyme classifier and facet wrapped by the parent CAZy class.

95% confidence interval around the mean of CAZy family classification.

Figure 7.14: 95% confidence interval around the mean of CAZy family classification.

95% confidence interval around the mean CAZy family classifier per CAZy class

Figure 7.15: 95% confidence interval around the mean CAZy family classifier per CAZy class

For better resolution we can group the CAZy families by their parent CAzy classes, and compare the performances of the tools CAZy class, by CAZy class. Owing to the minimal variation in specificity scores, specificity was plotted as the percentage specificity log10.

7.6.1 Glycoside Hydrolases

Figure 7.16 shows the plotting of sensitivity against specificity for each Glycoside Hydrolase CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.

Figure 7.16: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.

7.6.2 Glycosyltransferases

Figure 7.17 shows the plotting of sensitivity against specificity for each Glycosyltransferases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.

Figure 7.17: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.

7.6.3 Polysaccharide Lyases

Figure 7.18 shows the plotting of sensitivity against specificity for each Polysaccharide Lyases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.

Figure 7.18: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.

7.6.4 Carbohydrate Esterases

Figure ?? shows the plotting of sensitivity against specificity for each Carbohydrate Esterases CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.

Figure 7.19: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.

7.6.5 Auxillary Activities

Figure ?? shows the plotting of sensitivity against specificity for each Auxillary Activities CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.

Figure 7.20: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.

7.6.6 Carbohydate Binding Modules

Figure 7.21 shows the plotting of sensitivity against specificity for each Carbohydrate Binding Module CAZy family.

Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.

Figure 7.21: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.

7.7 Rand Index and Adjusted Rand Index of CAZy Family Classifications dingding

Table 7.5: Rand Index of CAZyme classifier classification of CAZy family annotations
Prediction_tool Mean Standard Deviation
dbCAN 0.9997 0.0011
HMMER 0.9996 0.0014
DIAMOND 0.9998 0.0010
Hotpep 0.9991 0.0023
CUPP 0.9995 0.0015
eCAMI 0.9994 0.0017
95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Figure 7.22: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.

Table 7.6: Adjusted Rand Index of CAZyme classifier classification of CAZy family annotations
Prediction_tool Mean Standard Deviation
dbCAN 0.9391 0.2359
HMMER 0.9250 0.2554
DIAMOND 0.9530 0.2105
Hotpep 0.8758 0.3083
CUPP 0.9098 0.2712
eCAMI 0.9077 0.2778

8 Conclusions

Overall, all CAZyme classifiers showed strong performances at all three levels of CAZyme classification (CAZyme/non-CAZyme. CAZy class and CAZy family).

Performance was extremely strong for CAZyme classifiers for across all levels of CAZyme classification, performance in CAZyme classifiers varied most greatly for sensitivity.

In general, the CAZyme/non-CAZyme, CAZy class and CAZy family classifications were accurate for all CAZyme classifiers (i.e. when a classification is predicted it was frequently correct). however, the CAZyme classifiers do not predict a comprehensive CAZome. CAZyme classifiers performance differed most greatly by sensitivity, which indicated an non-comprehensive annotation of the CAZome, CAZy class members and CAZy family members.

Classifying Bacterial or Eukaryote had neglebialbe impact on the performance of the CAZyme classification at at every level of classification (CAZyme/non-CAZyme. CAZy class and CAZy family).